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Abstract 


Assessing the performance of a learned model is a crucial part of machine learn¬ 
ing. However, in some domains only positive and unlabeled examples are avail¬ 
able, which prohibits the use of most standard evaluation metrics. We propose an 
approach to estimate any metric based on contingency tables, including ROC and 
PR curves, using only positive and unlabeled data. Estimating these performance 
metrics is essentially reduced to estimating the fraction of (latent) positives in the 
unlabeled set, assuming known positives are a random sample of all positives. We 
provide theoretical bounds on the quality of our estimates, illustrate the impor¬ 
tance of estimating the fraction of positives in the unlabeled set and demonstrate 
empirically that we are able to reliably estimate ROC and PR curves on real data. 

1 Introduction 

Model evaluation is a critical step in the learning process. Typically, evaluations either report sum¬ 
mary metrics, such as accuracy, El score, or area under the receiver operator characteristic (ROC) 
curve or visually show a model’s performance under different operating conditions by using ROC or 
precision-recall curves. All the aforementioned evaluation approaches require constructing contin¬ 
gency tables (also called confusion matrices), which show how a model’s predicted labels relate to 
an example’s ground truth label. Computing a contingency table requires labeled examples. How¬ 
ever, for many problems only a few labeled examples and many unlabeled ones are available as 
acquiring labels can be time-consuming, costly, unreliable, and in some cases impossible. 

The field of semi-supervised learning [1] focuses on coping with partially labeled data. Positive and 
unlabeled (PU) learning is a special case of semi-supervised learning where each example’s label is 
either positive or not known [2, 3, 4, 5, 6, 7, 8]. Both semi-supervised and PU learning tend to focus 
on developing learning algorithms that cope with partially labeled data during training as opposed to 
evaluating algorithms when the test set is partially labeled. What is less well studied is the effect of 
partially labeled data on evaluation. Currently, algorithms are evaluated assuming that the test data 
is fully labeled [9, 10, 11, 12, 13, 7, 8] and if the test data is only partially labeled, sometimes it is 
assumed that all unlabeled instances are negative when evaluating performance [14, 15, 16]. 

This paper describes how to incorporate the unlabeled data in the model evaluation process. We 
show how to compute contingency tables based on only positive and unlabeled examples where the 
unlabeled set contains both positive and negative examples, by looking at the ranking of examples 
produced by a model. Theoretically, we establish important relationships between contingency ta¬ 
bles and rank distributions, which allow us to provide bounds on the false positive rate at each rank 
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when the ranking contains examples whose ground truth label is unknown. Our findings have impor¬ 
tant implications for model selection as we show that naively assuming that all unlabeled examples 
are negative, as is sometimes done in PU learning, could lead to selecting the wrong model. We 
demonstrate the efficacy of our approach by estimating ROC and PR curves from real-world data. 


2 Background and definitions 

We first review the relevant background on model evaluation and issues caused by partial labeling. 


2.1 Rank distributions and contingency tables 

We focus on binary decision problems, where the goal is to classify examples as either positive 
or negative. Most learned models (e.g., SVM, logistic regression, naive Bayes) predict a numeric 
score for each example where higher values imply higher confidence that the instance belongs to the 
positive class. Typically, a ranking TZ is produced by sorting examples in descending order by their 
numeric score such that confident positive predictions are ranked close to the top of TZ. ^ 

Within a ranking TZ, we treat 'P C TZ as the subset of examples with positive labels, V — TZ — V as 
the subset of examples with negative labels, and let rank(7?., x) denote the rank of an instance x in 
TZ. Given a cutoff rank r, predictions can be made by assigning the positive class to the r top ranked 
instances and the negative class to the rest. This decision rule yields a true positive rate (TPR), 
which is the fraction of positive examples that are correctly labeled as positive, and false positive 
rate (FPR), which is the fraction of negative examples that are incorrectly labeled as positive: 

TPR(7^,r) = Pr(rank(7?., cc) <r\x£'P) = \{xG'P : rank(7?., cc) < r}| / |P|, (1) 

FPR(P, r) = Pr(rank(7?., x) < r \ x £p) = TPR(7?. — 'P, r). (2) 

Given the number of positives |P| and negatives \TZ — P’l, the contingency table for a rank r is: 

TP(P,r) =TPR(P,r) • |P|, (3) FP(P, r) = FPR(P, r) • |7^ - P|, 

FN{V,r) = \V\-TF{V,r), (4) TN(P, r) = |7^ - P| - FP(P, r). (5) 

The rank distribution of a set of instances P’ within an overall ranking TZ is defined as the distribution 
of their corresponding ranks within TZ. The rank cumulative distribution function (CDF) of a set of 
instances T’ is defined as the (empirical) CDF of their ranks, i.e. V r S |7^|}: 

F(P, r) = Pr(rank(7^, x) <r \ x GT’). (6) 

The concept of rank CDF is illustrated in Figure 1. Note that F(P, r) = TPR(P, r) (Equations (1) 
and (6)), that is, the rank CDF of the set of positives P at rank r in an overall ranking TZ can be 
interpreted directly as a true positive rate, when labeling the r top ranked instances as positive. 
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Figure 1: Rank CDF of two sets of positives T’l = 
{B,D,A,C} and P 2 — {E,G,F} within an overall 
ranking TZ = {B, E, D,G, H, A, F, C, /}, with |Pi | = 4 
and 021 = 3. In practice TZ is obtained by sorting the 
data according to classifier score. The rank CDF of a set 
S CTZ is based on the positions of elements of S in TZ. 


We use two convenience functions to partition sets of ranks: 

head(X, r) = { rank(7?., x) < r : x G X} and tail(X, r) = { rank(7?., x) > r : x G X}, 
such that head(2C, r) U tail(2f, r) = X and | head(X, r)| = F(X, r) • |2f|. 

2.2 ROC and PR curves 

Receiver operator characteristic (ROC) curves are used extensively for evaluating classifiers in ma¬ 
chine learning [17] as they illustrate the performance of a model over its entire operating range. 

’ Which means a low value for rank in this work, though this is often referred to as highly ranked in literature. 
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ROC curves depict how a model’s true positive rate (shown on the y-axis) varies as a function of 
its false positive rate (shown on the x-axis). Each cutoff rank r G |7?.|} corresponds to a 

single point (i.e., (FPR, TPR) pair) in ROC space (Eqs. (1) and (2)). An (empirical) ROC curve for 
a ranking 7Z and set of positives 7^ C 7^ is constructed by computing FPR(7^, r) and TPR(7^, r) at 
each rank r and interpolating by drawing a straight line between points corresponding to consecu¬ 
tive ranks. The area under an ROC curve (AUROC) is a commonly used summary statistic, typically 
ranging between 0.5 (random model) and 1 (perfect model). AUROC is a popular criterion in model 
selection and is often used as the optimization objective in hyperparameter search [17]. 

Precision-Recall (PR) curves [18] are an alternative to ROC curves that show how a model’s preci¬ 
sion (y-axis) varies as a function of recall (x-axis). Recall is equivalent to TPR and precision is the 
fraction of examples classified as positive that are truly positive (TP /(TP + FP)). PR curves are 
widely used when there is a skew in the class distributions [19, 8]. 

2.3 Evaluation with partially labeled data 

In the partial labeling setting, 7Z consists of disjoint sets of known positives Pl, known negatives A/l 
and unlabeled instances 77. The unlabeled set 77 consists of latent positives P/y and latent negatives. 
The fraction of latent positives in the unlabeled set plays a crucial role in our work, denoted by /3: 

/3 = Pr(x€Pulx€l^) = lPul/li^l. (7) 

Note that computing contingency tables requires fully labeled data. If only a few labeled instances 
of both classes are available, they can be used to compute rough estimates of predictive performance. 
However, if only positive labels are available, even a rough approximation of common metrics can¬ 
not be estimated directly as we do not know which unlabeled examples are positives and which 
are negative. A common approach to evaluate models in a PU learning context is to treat the full 
unlabeled set as negative [14, 15, 16], though we will show that this may lead to spurious results. 

3 Relationship between the rank CDF of positives and contingency tables 

The challenge of incorporating unlabeled data into an evaluation metric is knowing which unlabeled 
examples are latent positives and which are latent negatives. Our insight is that, if the known pos¬ 
itives are sampled completely at random from all positives, the rank distribution of latent positives 
should follow the rank distribution of known positives. Thus if we know /7, which is needed to 
compute the expected number of latent positives within the unlabeled data, this provides an avenue 
for building contingency tables that incorporate the unlabeled data. To do so, we first prove rela¬ 
tionships between rank CDFs of sets of positives within an overall ranking at a given rank r and the 
corresponding contingency tables. Then, we use these relationships to prove bounds on the FPR at 
a given rank r when the ranking includes unlabeled examples, some of which are latent positives. 

3.1 Rank distributions and contingency tables based on subsets of positives within a ranking 

We begin by considering given sets of positives within an overall ranking. Proofs of all lemmas can 
be found in Appendix A, along with figures to illustrate the associated property. 

Lemma 1. Given a rank r and two disjoint subsets of positives Pi and P 2 within an overall ranking 
7^. If\Pi\ = \P 2 \ andTVYl{Pi,r) > TPR(P 2 ,r). then FPR(iPi,r) < FPR('P 2 ,r). 

Lemma 2. Given a rank r and two disjoint sets of positives Pi, P 2 in a ranking P and Pq = 
PiUP 2 . IfTPR{Pi,r) <TPR{P 2 ,r)then TPR(iPi,r) < TPR(iPs 2 U) <TPR(iP 2 ,f). 

Corollary 1. Given a rank r and three sets of positives Pa, Pb ond Pc within a ranking P such 
that Pa H Pb = 0 and Pa H Pc = 0 and \Pb\ = \Pc\, then 

TPR{PB,r) < TPR{Pc, r) o TPR{Pa UPB,r) < TPR{Pa U Pc, r). 

3.2 Contingency tables based on partially labeled data 

Lemmas 1 and 2 describe relationships between rank distributions and contingency tables of dif¬ 
ferent (but known) sets of positives within an overall ranking. We now show how to construct 
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contingency tables corresponding to the greatest-lower and least-upper bound of the FPR at a given 
rank, accounting for the unknown set of latent positive example from partially labeled data, given /3. 

Theorem 1. Given an overall ranking TZ consisting of disjoint sets of known positives Vl, known 
negatives Ml cmd unlabeled instances lA, where lA contains an unknown set of latent positives Vjj C 
lA of known size \Vu\ = P ■ \IA\. Given a rank r and an upper bound Tub{^) TPR(7^(7, r), a tight 
lower bound on FPR(Pf 2 , r) with Vq = Vl U Vjj can be found without explicitly identifying Vjj. 

Proof: Step 1: assign a set of surrogate positives Vf:^ 

Vf = argmin TPR('Py,r) subject to TPR(Py,r) > Tub{r) and \Vlj\ = jS ■ \IA\, (8) 

•p*c w 

then TPR(P^,r) > TPR(Py,r) by construction. If \\\ea.d{lA,r)\ < (iTub{f) ■ \IA\, then no Vf 
exists that satisfies the constraint TPR(Py,r) > 7u&('r) in Equation (8).^ In this case, treat all 
instances in head(Z^, r) as surrogate positive, which trivially implies TPR(Py, r) > TPR(Pj/, r). 

Step 2: define Vq = Vl U Vf. Using Corollary 1 yields TPR(PQ,r) > TPR(Pf 2 ,r). Since 
\Vq \ = \Vq\, using Lemma 1 yields the lower bound on FPR, i.e., FPR(PQ,r) < FPR(PQ,r). ■ 

Applying Theorem 1 yields a nontrivial lower bound on FPR('Pq, r). In Lemma 3 we prove that 
FPR(Pq, r) is the greatest achievable lower bound based on a given lA C TZ. 

Lemma 3. Minimizing TPR(P^, r) in Equation (8) of Theorem 1 ensures FPR('Pq, r) is the great¬ 
est achievable lower bound on FPR(Pq, r) given fi, Tubir), TZ andlA. 

Due to its symmetry. Theorem 1 can also be used to obtain the least achievable upper bound of 
FPR(Pq, r) given a ranking TZ and a bound Tib{r) < TPR(Py, r) by assigning Vf such that; 

Vf = argmax TPR(P^,r) subject to TPR(P^,r) < 71&(r) and |P^| = /3 ■ \IA\. (9) 

v-cu 

4 Efficiently computing the bounds 

We now describe how to use Theorem 1 and Lemma 3 to compute the contingency tables corre¬ 
sponding to the greatest lower and least upper bound on FPR(Pq, r) from a finite sample. First, we 
explain how to compute contingency tables efficiently via Theorem 1 . Second, we propose how to 
obtain the bounds on rank CDF {Tib{r) and Tubir)) that are needed to build the contingency table. 

4.1 Computing the contingency table with greatest-lower hound on FPR at given rank r 

Given /3, TZ and the sets Vl, Ml, and lA, Theorem 1 enables computing contingency tables corre¬ 
sponding to the least upper and greatest lower bound on FPR at a given cutoff rank r. We focus on 
building the contingency table corresponding to the lower bound on the FPR, the other is analogous. 

We decompose the computation to consider the labeled and unlabeled instances separately: 


[TP^ FPJ,] 


TP^^ = |head(iPL,r)| 

FP1 = |head(A/'L,r)| 

1 

TP^r FP[,' 

FN^ TN^_ 


FN£ = |tail(7^i,r)| 

TN^ = |tail(A/i,r)| _ 

“T 

FN[, TN[} 


Given that at rank r we can directly compute partial contingency tables for the labeled data based 
on TZ, Vl and Ml, we focus on computing the contingency table for the unlabeled instances. 

Given Tubir), we can use Theorem 1 to determine the values in the contingency table for the unla¬ 
beled instances for the greatest lower bound on FPR. Doing so requires inferring a set of surrogate 
positives Vf from the unlabeled data, which must be a solution to Equation (8). This requires 6 
surrogate positives in head(7^^, r) and the rest in ta\\{Vf, r), where 9 is defined as: 

e = iTubir) ■ \Vf\] = \Tub{r) ■ ■ \u\\, (10) 

By rounding up in Equation (10), we ensure that TPR(7^^, r) > Tub{r) as required by Theorem 1. 

^ A surrogate positive is an example that we treat as if its ground truth label is positive (even though in reality 
its ground truth label is unknown) when constructing a contingency table. 

^An infeasibility implies that Tub{r) and/or (3 are too high. 
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In practice, two comer cases must be considered. One is if | head(W, r)\ <9, then it is impossible 
to assign 9 surrogates below rank r in U. In this case, all of head(if,r) is assigned as surrogate 
positives and the remaining surrogates are in tail(Z^,r) (as discussed in Theorem 1). Two is if 
I tail(if, r) I < \VIj- I—6>, in which case all of t&\\{U, r) is labeled positive and the remaining surrogate 
positives inevitably end up in head(if, r). Hence, any set of surrogate positives V{j that meets the 
following criteria solves Equation (8) and thus yields a valid bound; 


\Vu\ = P-\U\ and | head('Py,r)| = 


f min (I head(Z//, r)|, 0) 
\\K\- |tail(Z^,r)| 




6» < |tail(Z^,r)|, ,,,, 
9 > I tail(Z^,r)j. 


Given a set of surrogate positives V^, the partial contingency table of interest becomes: 

TPE, FP^' 

FNy TNy 

where U — Vy is the set of surrogate negatives and | and | head(7^y, r)| are known via Eq. 11. 

Note that computing the partial contingency table for the unlabeled data can be done very efficiently 
since it only requires set sizes as shown in Equation 12, without explicitly partitioning the unlabeled 
set U. That is, we do not need to know which examples are in head(7^y , r), tail(7^y , r), head(if — 
r) and i&\\{U — we just need to know the number of examples each set contains. 

The contingency table with least upper bound on FPR(7^[/, t") is obtained by replacing Eq. (10) by: 


I head(7^y, r) | | head(W — , r) | 

itail('P^,r)| itail(Z^-T’^,r)| 


0 = [Tibir) ■ \r*u\\ = [rit{r)-/3-\U\\. 


(13) 


4.2 Bounds on the rank distribution of Vu 

Applying Theorem 1 to build a contingency table at rank r requires a bound Tub{r) > TPR(7^c/, r) 
for estimating a lower bound on the EPR and a bound Tib{r) < TPR(7^[/, r) for estimating an upper 
bound on the EPR. To compute these bounds, we assume known and latent positives have similar 
rank distributions. This holds when known positives Vl are selected completely at random from all 
positives Vq, but is violated if the process of selecting examples for labeling is biased [11]. 

TPR('Pf 2 , r) is estimated via the empirical rank CDE of Vl, which only approximates the tme CDE. 
To acccount for uncertainty, we construct confidence intervals (CIs) for the CDE. Our assumption 
implies that a Cl of the CDE based on Vl is also a Cl of the CDE of Vu- A Cl boundary is treated 
as a function mapping rank r to the estimated bound on the CDE. Tib and Tub denote these bounds: 

0 < Tib{r) < TPR{VL,r),TPR{Vu,r),TPR{V^,r) < Tubir) < 1, Vr. (14) 

We formalize the bounds of the Cl of the CDE as functions of rank because an underlying set with 
that rank distribution does not necessarily exist in the overall ranking TZ. 

The confidence band on rank CDE can be computed based on the known positives in several ways. 
We use a standard bootstrap approach [20] in our experiments. Having many known positives yields 
a tight confidence band on rank CDE, which then translates to tight bounds on performance metrics. 


5 Constructing ROC and PR curve estimates 

Next, we describe how to estimate bounds on the true ROC and PR curves. Though we focus on 
these two criteria, our approach can be used to estimate any metric based on contingency tables. 

ROC curves Given a ranking, instead of constructing a single ROC curve, our approach computes 
two curves: one corresponding to the upper bound and one corresponding to the lower bound on the 
Cl on rank CDE of known positives Vl, using the methodology outlined in Section 4 to compute 
two contingency tables for each rank r, corresponding to the greatest lower and least upper bound 
on FPR(7^f2, r). The set of contingency tables corresponding to greatest lower bounds on EPR at 
each rank form an upper bound on the ROC curve of all positives Vq, whereas the set of contingency 
tables corresponding to the least upper bound on EPR form a lower bound on the ROC curve of Vq. 

It is important to understand how these estimates correspond to bounds in ROC space. By computing 
9 as in Equation (10) to obtain the greatest lower bound on FPR(7^t7, r), the corresponding TPR is 
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higher than TPR(7^(77 ^)- As such, the upper bound on the ROC curve is shifted upwards and to the 
left. Conversely, the lower bound on the ROC curve (based on the least upper bound on FPR at each 
rank, i.e. 9 as in Equation (13)) is shifted downward and to the right. This implies that the upper 
bound on the ROC curve completely dominates the curve of Vq and the lower bound is completely 
dominated by the curve of Vq, provided that Tib{r) < TPR(7^(7, r) < Tub{r), Vr S {1,..., |7?.|}. 

Convergence properties The convergence properties of our bounds are contingent on those of (a Cl 
on) the empirical CDF; via the strong law of large numbers the empirical CDF Fn{x) is a consistent 
pointwise estimator of the true CDF F{x), converging uniformly for increasing n [21]. 

Figure 2 shows the convergence of the bounds on area under the curve for the estimated lower and 
upper bound of the ROC curve for increasing amounts of known positives in simulated rankings. 
The range of bounds depends on the width of the Cl on rank CDF, which in turn depends on the 
number of known positives (higher is better) and the size of the total data set (lower is better). 

PR curves Given the contingency tables used to generate the least upper bound and greatest lower 
bound ROC curves, it is straightforward to construct the corresponding bounds in PR space. Each 
contingency table contains all the required information for generating a point in PR space. 

A key result relating ROC and PR curves is that one curve dominates another in ROC space if and 
only if it also dominates in PR space [18]. Given this result, mapping the bounds we obtain for 
ROC curves to PR space directly yields (tight) bounds on the corresponding true PR curve. Since 
the upper bound in ROC space completely dominates the true curve, and the lower bound in ROC 
space is completely dominated by it, the same holds for the bounds on PR curves. 

6 Discussion and Recommendations 

Next, we discuss several issues related to using our approach in practice. 

6.1 Determining /3 and its effect 

Our approach requires having an estimate (3 of /3. There are many problems where /3 is known from 
domain knowledge (e.g., calculated and published based on a data source you do not have access 
to), but explicit negatives are scarce or unavailable in the data under analysis. A real-world example 
where this is true is the task of predicting whether someone has diabetes from health insurance data 
[22]. In this context, some individuals are coded as having diabetes, but many diabetics are undiag¬ 
nosed and hence it is wrong to assume that all unlabeled patients do not have diabetes. However, the 
incidence rate of diabetes is known and published in the medical literature. This type of situation 
characterizes many medical problems. If /3 is not known from domain knowledge, then it could be 
estimated from data [5, 6, 23]. 

In either case, if (3 is not exact, the conditions of Lemma 1 are potentially violated where it is used 
within Theorem 1 . The effects of set size on FPR is characterized in Lemma 4, which will help us 
understand the effect of over or under estimating f3. 

Lemma 4. Given two sets of positive labels Vi and V 2 within an overall ranking TZ and a rank r, 
such f/iflf TPR(7^i, r) = TPR(7^2;f) = land \'Pi \ > \'P 2 \, then: 

(а) FPR(iP 2 , r)<t^ FPRCPi, r) < FFR{V 2 ,r), 

(б) FPR(iP 2 , r)>t^ FPRCPi, r) > FPR(P 2 , r). 

(a) corresponds to a ranking and cutoff that is better than random (i.e. TPR(P, r) > FPR(P, r)). 

Lemma 4 has a large practical impact. If the ranking of is better than random, then over and under 

estimating (3 is useful to obtain a (loose) upper/lower bound on performance curves, respectively. 
In other words, given bounds or a Cl on /3, that is ffo ^ /3 < /3„p, we can use /3io and 
to estimate a lower and upper bound on the true ROC or PR curve. Bounds computed based on 
a Cl for p constitute a Cl for the performance metric (at the same confidence level), assuming 
the rank CDF of Vu is contained by the confidence band on the rank CDF. Tighter bounds on P 
translate directly to tighter bounds on performance estimates. Finally, treating the full unlabeled set 
as negative underestimates the true performance, since p = 0 < p. The effect of varying p is shown 
in Figure 3. 
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Figure 2: The effect of \Vl\ on estimated AUC. Based on 
\IA\ = 100, 000, Ml = 0 and P = j3 = 0.2. Bounds on rank 
CDF were obtained via bootstrap. The depicted confidence 
intervals are based on 200 repeated experiments. 



estimated ROC curves, based on 
2,000 known positives, 100,000 
unlabeled instances and /3 = 0.3. 


6.2 Model selection 


Often evaluation metrics are used to select the best model from a set of candidates. If model A’s 
ROC (PR) curve dominates model B’s ROC (PR) curve, then for all P model A is better than model 
B (leaving aside significance testing). However, in most cases one model does not dominate another 
model and there exists a point where the two curves cross. Surprisingly, the ordering in terms of 
both AUROC and AUPR are dependent on P when this happens. This means that the ordering of 
models according to these metrics can switch when p changes. Figure 4 depicts an example that 
illustrates this. This demonstrates that P can play a crucial role in model selection. In the likely 
event that the curves cross, it is important to look at the range of possible values for p that represent 
different operating conditions when selecting among different models. 

A more formal explanation of why this occurs can be made based on partial derivatives of each entry 
of the partial contingency table and TPR, FPR and precision based on unlabeled instances to pp 

()TPR ^^0 |head(^^,r)|-r(r).|^| OPRE^^ T{r)-\U\ 

dp ’ dp (1-/3)^ ’ dp |head(Zi,r)| - 

The partial derivative of TPR is exactly 0 because our approach is based on rank CDFs (that is TPR at 
each rank). Interestingly, the partial derivatives of FPR and precision to p are dependent on the value 
of the rank CDF T(r) that is being used to infer surrogate positives. Since TPR is not a function 
of P and the partial derivatives of FPR/precision to p are functions of T{r), distinct segments of 
an ROC/PR curve are moved differently when p changes, inducing a non-uniform scaling of AUC 
across the TPR range. Such scaling potentially changes the ordering of models based on AUC. 


rank CDF of Vl ROC curves 




Figure 4: The effect of P on ROC curves. 
Setup; \U\ = 45,000, \rL\ = 5,000. 

Corresponding AUROC (best in bold); 


estimated p 

model 1 

model 2 

0.0 

72.5% 

73.2% 

0.1 

75.5% 

74.7% 


6.3 Empirical quality of the estimates 

We illustrate the quality of our estimated bounds on ROC and PR curves using a model trained in 
a PU learning setting in [8] on the covtype data set [24]. The model was evaluated on a fully 
labeled test set of 20,000 positive and 20, 000 negative examples. To estimate performance, we 
randomly selected 5% of positive examples to serve as our labeled set and treated all other examples 

"'We made some simplifications, the details are described in Appendix B. 
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as unlabeled, which yields \Vl\ = 1,000, \U\ = 39, 000 and (3 « 49%. We present ROC and PR 
curves with bounds for /3 = 13, $ = 0, and a confidence interval f3io = 0.8/3 < /3 < /3„p = 1.2/3. 
Finally, as we have the ground truth, we present true curves as a reference.^ 

Figure 5 presents the rank CDF and estimated bounds on ROC and PR curves. Figure 5(a) shows the 
true rank CDF of Vu along with an estimated 95% Cl on the rank CDF using the Vl via a standard 
bootstrap approach with 2, 000 resamples. In this case, the Cl contains the true rank CDF of latent 
positives.® Figures 5(b) and 5(c) show that the bounds closely approximate the true performance 
curves. The estimated bounds are wider in PR space than in ROC space, particularly at low recall. 
Note that estimated PR curves are sensitive to the estimation error in /3, as precision is directly 
affected by class balance, limiting their usefulness if only a rough estimate of /3 is available. 



0 FPR 1 

(b) ROC curves. 


0 Recall 1 

(c) PR curves. 


1 


1 


Figure 5: Results for covtype showing rank CDF, ROC and PR curves, with /3 « 49%. 
Performance curve legend; -true curve,-/3 = 0, /3 = /3 and 0.8/3 < /3 < 1.2/3. 


6.4 Relative importance of known negatives compared to known positives 

As our approach can incorporate known negatives, a natural question is how their presence influences 
the estimates. In practice, a test set is of fixed size, so known negatives essentially reduce the size of 
the unlabeled subset, which in turn reduces the number of degrees of freedom in assigning surrogate 
positives. Using the same setup as in Subsection 6.3, we varied the proportion of known positives 
and negatives and found known negatives provide some benefit, though this is small in practice. 
However, our approach can also be reversed given a large amount of negatives, that is flip known 
class labels, use P = 1 — f3 and adjust the resulting contingency tables accordingly, which can 
improve performance bounds. The benefits of known negatives are further discussed in Appendix C. 

7 Conclusion 

We presented an approach to construct contingency tables corresponding to a lower and upper bound 
on FPR using only partially labeled data, which enables computing many commonly used perfor¬ 
mance metrics in a semi-supervised setting. Our approach relies on knowing the fraction of latent 
positives in the unlabeled data, and we discussed its effect on determing the bounds and model 
selection. We have seen that our approach can yield good estimates in practice. 
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Supplementary material for ^‘Assessing Binary Classifiers Using Only Positive 
and Unlabeled Data” 

A Proofs 


Lemma 1. Given a rank r and two disjoint subsets of positives Vi and V 2 within an overall ranking TZ. If 
I'Pil = \P 2 \ and TPR('Pi, r) > TPR('P 2 ,?'), f/ien FPR('Pi, r) < FPR('P 2 ,?’) (see Figure 6). 



Figure 6 : Illustration of Lemma 1: higher TPR at a given rank r implies lower FPR at r for two 
positive sets of the same size. 


Proof: The numerator of FPR is the number of false positives, this is the number of positive predictions minus 
the number of true positives. Via Equations (2) and (3), this is r and TPR('P, r) • \'P\, respectively: 


FPR('P,r) 


r-TPR(P,r)- \P\ 

\Tz\-\r\ 


(16) 


Since \'Pi\ = \P 2 \, the denominators o/FPR('Pi,r) and FPR('P 2 ,r) are equal, so TPR('Pi,r) > 
TPR(P 2 ,r) ^ FPR(Pi,r) < FPR(P 2 ,r). ■ 

Lemma 2. Given a rank r and two disjoint sets of positives Vi and V 2 in a ranking TZ and "Pq = "Pi U p 2 . If 
TPR(Pi, r) = 11 < TPR(p 2 ,r) = t 2 then TPR(Pi, r) < TPR(Pn, r) < TPR(p 2 ,r) (see Figure 7). 


TPRfPi,r*) 

TPRCPj.r*) 

rank r 



Figure 7: Illustration of Lemma 2: (F{-) denotes feasible region. The rank distribution of the union 
Vq of two sets of positives Vi and 1^2 lies between their respective rank distributions. 


Proof: write TPR(Pf 2 , r) in terms oft\ and 12 : 


TPR(Pn,r) 


II ■ |Pl| + t2 ■ IP 2 I 
l^l| + l^2| 


(17) 


since ii < t 2 , we get t\ < TPR(Pf;i t) < t 2 . ■ 

Lemma 3. Minimizing TP'R{Pij,r) in Equation (%) of Theorem I ensures FPY{,{V a, t) is the greatest achiev¬ 
able lower bound on FPR(Pf 2 , r) given fj, Tu.b{r), TZ and LI. 

Proof (by contradiction): suppose another set of surrogate positives PJ C U exists with \'P^\ = j3 ■ \U\, such 
that Vu 7 ^ Vu, and TPR(Py, r) > Tnb(r) and for 'P^='PlG PJ-' 

FPR(Ps^,r) < FPR(P^,r) < FPR(Pn,r). 

Via Corollary I this implies TPR(Py, r) < TPR(Py, r), which contradicts the definition ofPf (Eq. (8)). ■ 
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Corollary 1. Given a rank r and three sets of positives Va, Vb and Vc within a ranking TZ such that Va H 
Vb = 0 and Va n Vc = 0 and \Vb\ = \'Pc\, then 

TPR('Ps, r) =tB< TPK{Vc,r) = tc TPR('Pa U Ps, r) < TPR('Pa U Vc,r). 

Proof: all terms are equal for TPR('Pa U VB,r) and TPR('Pa U Vc, r) except ts < tc in Eq. (17). 

Lemma 4. Given two sets of positive labels V\ and V 2 within an overall ranking TZ and a rank r, such that 
TPR('Pi,r) = TPR('P2,r) = land \Vi\ > \V 2 \, then: 

(a) FPR(V2,r) < t FPR(Pi,r) < FPR{V 2 ,r), 

FPR(p2,r) > t FPR(Pi,r) > FPR(P2,r). 

(a) corresponds to a ranking and cutoff that is better than random (i.e. TPR('P, r) > FPR('P, r)) whereas 

(b) corresponds to a ranking and cutoff that is worse than random. 



Figure 8: Illustration of Lemma 4, with Va C TZ, Vb C TZ, Vc C TZ, \Va\ > \'Pc\ 


'\Vb\ > \Vc\\ 


If two sets of positives Vi and V 2 achieve a given TPR at the same rank r, e.g. 


TPR{Vi,r) = TPR{V 2 ,r) and \Vi\ > \V 2 \ then FPR(iPi,r) < FPR(iP 2 ,r) if FPR(iP 2 ,r) < 
TPR('P 2 ,?') and otherwise FPR('Pi,r) > FPR(P 2 )f)- 


Proof: take the derivative ofFPR to \V\ while fixing r, based on Equation (16).' 

dFPR(P,r) _ r-^•|7^| 
d\V\ -(|7^|-|P|)2’ 

_r-t-\V\-t-\TZ-V\ 

{\TZ\-\V\Y 


(18) 


r — t ■ \V\ is the number of negatives in the top ranking (false positives) and t ■ \TZ — V\ is the number of false 
positives at FPR = t. The derivative is negative if the FPR is below t and vice versa, therefore if the ranking 
is better than random (TPR = t > FPR), increasing \V\ leads to a lower FPR at rank r and vice versa. ■ 


B Effect of f3 on contingency table entries and common performance metrics 

To study the effect of imprecise estimates of /3, we start by computing partial derivatives of each entry of the 
partial contingency table based on unlabeled instances to /3 (see Section 4.1). Subsequently, we will compute 
partial derivatives of TPR, FPR and precision to /3 to describe the effect of estimating /3 on (area under) ROC 
and PR curves. 

For ease of notation, we base all subsequent calculations on 6 — j3T{r) ■ \U\ ~ 9 which ignores the discrete 
effect of rounding in the real definition of 9 (Eq. 10). We additionally assume it is possible to assign the desired 
amount 9 of surrogate positives in head(W, r), which holds for ranks r that are not too close to the top or 
bottom of TZ, given reasonable values of jd and CDF bounds T{r)J If this does not hold, that is when there is 
clipping in Eq. 11, then (small) changes in jd do not affect TP[, and hence the partial derivatives of all entries 
in the contingency table to jd are effectively 0. 

’’T'ir) represents a bound on rank CDF, that is either 7ib{r) or Tub(r) as used in the manuscript. 
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Given these simplifications, the partial contingency table based on unlabeled instances becomes: 

TP[, = 0 = pT{r) ■ \U\ 

FN[} = \Vl\ - TP^ = /3 . \U\ - pT{r) ■ \U\ = /3(l - r(r)) • \U\ 

FP[, = |head(W,r)| -TP[, = | head(W, r)| -/3r(r) • \U\, 

TNE, = \U\ - \Vi\ - FPE, = \U\-p-\U\-\ head(W, r)| + pT{r) ■ |W|, 

= (l — /3 + PT{r)) -1^1 — 1 head(W,r)|. 


The partial derivatives of each entry of the partial contingency table then become: 
dTPh aFP[, 


d/3 

aFN[, 

dp 


= nr) ■ \U\ > 0. 

= (l-r(r)).|W|>0, 


dp 

0TN[, 

dp 


= -r(r) • \U\ < 0, 

= (r(r) - 1) . \U\ < 0. 


Partial derivatives for TPR, TPR and precision are a little more involved: 


0TPR£f 

dp 

aFPR[, 

dp 


9PRE[, 

dp 


^ r(r)/3 ■ - r(r)/3|Tip ^ Tjr) - T{r) ^ 

/32|W|2 p 

■ {\U\ - \n\) - FP[^ ^ -r(r).|fY|-(|Ti|-|P^,|) + FP[,.|Ti| 

m\-\vUY {M-WUY 


-nr){l-h-P?+ 'Pn-P\ -Tjr) (|head(ff,r)|-/3r(r).|Z^|).|^| 

(l-/3)2-|W|2 1-/3 (i_/3)2.|z^|2 

—T{r) |head(W,r)| _ | head(W, r)| — T(r) • | 

(l_/3)2 + (l-/3)2.|W| ^ (l-/3)2 


8/3 


(TPJ>+FP[,) -TPJ 


a(TPi^ + FPf^) 
8/3 


(TP[, + FPj; 


T{r)-M-{^Plj+FPh) 

(TPI> + FP[,)2 


Tjr) ■ M 
(TP[,+FPjy) 


Tjr) ■ \K\ 

I head(W, r)| 


> 0 


( 20 ) 


( 21 ) 


Both 9FPRJ} /dp and OPREj} /dP are a function of T(r), while 9FPR[, /dP = 0. This implies that the 
ordering of rankings in terms of area under the ROC curve can change when the estimate of P changes, as 
proven by example in Figure 4. 


C The effect of the fraction of known positives, known negatives and j3 

Known negatives can be incorporated in our approach as described in Section 4.1. Given a fixed ranking 
TZ, having known negatives essentially reduces the size of the unlabeled subset U, which in turn reduces the 
number of degrees of freedom in assigning surrogate positives. As such, known negatives provide some benefit, 
though this is small in practice. Table 1 illustrates the effect of increasing amounts of known positives and 
known negatives: known positives significantly tighten bounds on AUROC, while known negatives only do so 
marginally (cfr. bounds with 10% known positives and 40/60/80% known negatives). 

However, when the number of known negatives is large, it may be useful to reverse our approach, i.e., start 
from the rank distribution of known negatives. To do so, we can essentially flip all known class labels, use 
P = 1 — P and adjust the resulting contingency tables accordingly. 

Table 2 shows bounds when based on known positives or known negatives (whichever are tightest). It is 
important to see that \Nl\ > \'Pl\ does not guarantee that performance bounds based on known negatives are 
tighter, because P also affects the bounds. When computing performance bounds based on known negatives, 
overestimating /3 leads to underestimated bounds (since we use p = 1 — p) and vice versa. The effect of errors 
in p is opposite in bounds based on Ml- 

Hence, bounds on performance metrics can be computed based primarily on known positives Vl or known 
negatives Ml- The width of the bounds depends on the combination of \Vl\ (or |A/l|) and P (or p) in a 
nontrivial way: depending on p, it is possible to obtain wider bounds based on known negatives, even if 
\Ml I > \'Pl \ (or vice versa). In practice, we can estimate metrics based oiiVl and Ml separately and then use 
whichever yields the tightest bounds, as shown in Table 2. 
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configuration 
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-Pn 




P 


bounds on area under the ROC curve (true AUROC=76.8%) 


/?//? = 0.8 


/3//? = 1.0 


/3//3 = 1.2 


10 


30 


50 


70 


0 

20 

40 

60 

80 


0 

20 

40 

60 

80 

0 

20 

40 

60 

80 


15 

18 

23 

31 

47 


9 

11 

14 

20 

33 

6 

7 

9 

13 

23 


67% 


76.8% 


67% 


67% 


76.8% 

a 
y 
y 
y 
y 

76.8% 


87% 


87% 


67%; 76.8% 

y 
y 
y 
y 
y 

67%; 76.8% 




0 

12 i ^ 

1 



1 

1 

1 

1 

1 

1 


1 

1 

1 

; ;eTI 


20 

15 i ; 

1 

;Q3 


1 

1 

1 

1 


1 

1 

1 



40 

19 i 

1 

,c3 


1 

1 

1 

1 

1 

1 

;Cp; 

1 

1 

1 

; IS 


60 

26 1 ; 

1 



1 

1 

1 

1 

1 

1 


1 

1 

1 

i ;; ;iTi 


80 

41 I ; 

1 

;Q3 


1 

1 

1 

1 

1 

1 


1 

1 

1 

; 



87% 


76.8% 


87% 


87% 


87% 


Table 1: Estimated bounds on AUROC under different configurations. The total data set comprises 
2, 000 positives and 10,000 negatives. We varied the fraction of known positives and known neg¬ 
atives, which also implies changing /3. All entries in the table are in percentages. We used three 
estimates for /3, namely an underestimate, the correct value and an overestimate (left to right). 


Legend;-true AUROC, | 


I bounds based on known positives. 
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configuration 

bounds on area under the ROC curve (true AUROC=76.8%) 

\'Pl\ lA/il o 

\'Pn\ ^ 

^//3 = 0.8 

/3//? = 1.0 

m 13 = 1.2 


10 


30 


50 


70 
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20 

40 
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20 

40 

60 

80 

0 
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80 

0 
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60 

80 


15 

18 

23 


12 

14 

18 

25 

41 

9 

11 

14 

20 

33 

5 

6 
9 
13 
23 


67% 


76.8% 


87% 67% 76. 


87% 67%; 76. 


67%; 76.8% 

Ll 

3 

67% . 76.8% . 87% 


87% 67% 


76.8% 

@ 
0 
0 
0 
0 

67% 76.8% 

0 
0 

0 
0 


87% 


67%; 76.8% 

0 
0 


87% 


67%: 76.8% 

E3 
0 
0 
0 
0 


60 31 ! 

1 



N S'M: 

80 47 ■ 

1 

:i: □ 1 

i 0; i 

I 


MmO0m: : 

1 : : : : 1 

i ;[!□; M 

! 


i ;■ 

0-0OM; 1 

1 : ^ 1 


i ^ ^:: i: i 

i M a ;;i 

i tt! M 

M ; 

i ; ;i 

; 0;; ;; 


87% 


87% 


87% 


Table 2: Estimated bounds on AUROC under different configurations. The total data set comprises 
2, 000 positives and 10,000 negatives. We varied the fraction of known positives and known neg¬ 
atives, which also implies changing /3. All entries in the table are in percentages. We used three 
estimates for jS, namely an underestimate, the correct value and an overestimate (left to right). In 
this table, we computed bounds based on known positives and known negatives (separately) and 
report the tightest confidence interval each time. 

Legend;-true AUROC, bounds based on known positives and | known negatives. 
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